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SUMMARY 


This  document  provides  a  summary  of  work  eompleted  by  General  Dynamics  under  the 
work  unit  71840871,  Speeeh  Interfaees  for  Multinational  Collaboration,  for  the  period  August 
2004  to  November  2009  under  contraet  FA8650-04-C-6443.  The  speech  technologies  developed 
during  this  period  inelude  speeeh  reeognizers,  Artieulatory  Feature  (AF)  deteetors,  and  speeeh 
synthesizers.  Speeeh  reeognition  systems  were  developed  for  15  different  languages,  and  three 
methods  were  investigated  for  improving  the  performanee  of  the  systems:  voeal  tract  length 
normalization,  speaker  adaptive  training,  and  reeognizer  output  voting  error  reduction.  English 
AF  detectors  were  developed  using  Gaussian  Mixture  Models  (GMMs),  two-elass  Multi-Layer 
Pereeptrons  (MLPs),  fusion  MLPs,  and  multi-elass  MLPs.  The  outputs  of  the  AF  deteetors  were 
used  to  form  the  feature  set  for  a  speech  recognizer.  Speech  synthesis  systems  were  ereated  for 
13  different  languages,  and  the  following  system  modifieations  were  investigated:  expanding  the 
label  set  to  include  additional  eontextual  factors,  changing  the  minimum  deseription  length 
eontrol  factor,  and  applying  speaker  elustering  and  adaption  to  ereate  new  voiees.  In  addition, 
two  graphieal  user  interfaees  were  developed  for  training  new  voiees  and  synthesizing  speeeh  in 
real-time. 

The  author  would  like  to  aeknowledge  the  following  groups:  (1)  Army  Researeh 
Laboratory  for  the  Dari  speeeh  eorpus,  (2)  Cambridge  University  for  their  Hidden  Markov 
Model  ToolKit  (HTK),  (3)  Carnegie  Mellon  University  and  Cambridge  University  for  their 
Statistieal  Language  Modeling  Toolkit,  (4)  the  Julius  projeet  team  at  Nagoya  Institute  of 
Technology  for  their  Julius  speech  recognition  engine,  (5)  Bryan  Pellom  of  the  University  of 
Colorado  for  the  SONIC  speeeh  reeognizer,  (6)  the  speech  processing  group  at  the  Brno 
University  of  Teehnology,  Laculty  of  Information  Teehnology  for  their  SRover  software,  (7)  MIT 
Lineoln  Laboratory  for  their  GMM  software  paekage,  (8)  the  International  Computer  Seienee 
Institute  for  their  QuickNet  software  paekage,  (9)  Joe  Frankel  et  al.  for  their  AL  elassifiers,  (10) 
Simon  King  et  al.  for  the  SVitehboard  I  corpus,  (11)  Karen  Livescu  et  al.  for  their  manual  AL 
transeriptions,  (12)  the  HTS  working  group  for  their  HMM -based  Speeeh  Synthesis  Toolkit,  (13) 
the  SPTK  working  group  at  Nagoya  Institute  of  Teehnology  for  their  Speeeh  Signal  Proeessing 
ToolKit,  and  (14)  KTH  for  the  Snack  Sound  Toolkit. 
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1.0  INTRODUCTION 


This  document  provides  a  summary  of  work  completed  by  General  Dynamies  under  the 
work  unit  71840871,  Speeeh  Interfaees  for  Multinational  Collaboration,  for  the  period  August 
2004  to  November  2009  under  eontraet  FA8650-04-C-6443.  The  Seetion  2  deseribes  how 
speech  recognition  systems  were  developed  for  15  different  languages,  and  presents  three 
methods  that  were  investigated  for  improving  the  performance  of  these  systems.  Section  3 
describes  how  artieulatory  feature  deteetors  were  ereated  for  English  and  applied  to  speeeh 
reeognition  tasks  in  English,  Russian,  and  Dari.  Seetion  4  deseribes  how  speeeh  synthesis 
systems  were  developed  for  13  different  languages,  and  provides  a  brief  overview  of  two 
graphieal  user  interfaees  that  were  developed  for  creating  new  voiees  and  synthesizing  speech. 
Einally,  Section  5  summarizes  the  work  eompleted  and  provides  reeommendations  for  future 
research. 
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2,0  SPEECH  RECOGNITION  IN  15  LANGAUGES 


Speech  recognition  systems  were  developed  for  15  different  languages  using  the  Hidden 
Markov  Model  (HMM)  ToolKit  (HTK).  This  chapter  discusses  these  recognition  systems  and 
presents  three  methods  that  were  investigated  to  improve  the  performance  of  these  systems: 
Vocal  Tract  Length  Normalization  (VTLN),  Speaker  Adaptive  Training  (SAT),  and  the  ROVER 
technique.  Section  2.1  provides  an  overview  of  the  baseline  recognition  systems  developed  for 
each  language.  Section  2.2  discusses  VTLN  and  presents  results  obtained  on  English,  Mandarin, 
and  Russian.  Section  2.3  provides  an  overview  of  SAT  and  presents  results  obtained  on  Russian 
and  Dari.  Lastly,  Section  2.4  describes  the  ROVER  technique. 

2.1  Baseline  Recognition  Systems 

This  section  discusses  the  baseline  speech  recognition  systems  that  were  developed  for 
Arabic,  Croatian,  Dari,  English,  Erench,  German,  Japanese,  Korean,  Mandarin,  Pashto,  Russian, 
Spanish,  Tagalog,  Turkish,  and  Urdu.  A  total  of  seven  different  corpora  were  used  to  obtain 
coverage  of  all  15  languages,  including  the  Topic  Detection  and  Tracking  (TDT4)  Multilingual 
Broadcast  News  corpus  [1],  Phase  II  of  the  Wall  Street  Journal  (WSJ  1)  corpus  [2],  CALLHOME 
Mandarin  Chinese  [3],  HUB4  Mandarin  Broadcast  News  Speech  [4],  GlobalPhone  [5],  the 
Language  And  Speech  Exploitation  Resources  (LASER)  Advanced  Concept  Technology 
Demonstration  corpus,  and  the  ARE  Dari  corpus.  The  TDT4,  WSJl,  CALLHOME,  and  HUB4 
corpora  are  available  from  the  Linguistic  Data  Consortium,  and  the  ARL  Dari  corpus  was 
collected  by  Army  Research  Laboratory  with  support  from  AERL.  Table  1  lists  the  corpora  used 
for  each  language,  the  speaking  style  of  each  corpus,  the  total  amount  of  training  data  used  to 
develop  the  recognizers,  and  the  vocabulary  size. 
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Table  1:  Overview  of  Corpora 


Language 

Corpus 

Speaking  Style 

Hours 

Vocabulary  Size 

Arabic 

TDT4 

Broadcast  News 

37 

47k 

Croatian 

GlobalPhone 

Read 

12 

22k 

Dari 

ARE 

Read 

20 

2k 

English 

WSJl 

Read 

18 

10k 

Erench 

GlobalPhone 

Read 

20 

21k 

German 

GlobalPhone 

Read 

14 

23k 

Japanese 

GlobalPhone 

Read 

26 

18k 

Korean 

GlobalPhone 

Read 

16 

50k 

Mandarin 

CALEHOME 

Conversational 

26 

8k 

Mandarin 

HUB4 

Broadcast  News 

30 

18k 

Pashto 

EASER 

Read 

17 

6k 

Russian 

GlobalPhone 

Read 

18 

29k 

Spanish 

GlobalPhone 

Read 

17 

19k 

Tagalog 

LASER 

Read 

9 

5k 

Turkish 

GlobalPhone 

Read 

13 

15k 

Urdu 

LASER 

Read 

45 

8k 

HMM -based  recognition  systems  were  trained  for  each  language  using  HTK  [6].  *  The 
feature  set  consisted  of  12  Mel-Frequency  Cepstral  Coefficients  (MFCCs),  with  cepstral  mean 
subtraction,  plus  an  energy  feature.  Delta  and  acceleration  coefficients  were  also  included  to 
form  a  39  dimensional  feature  set.  The  acoustic  models  were  state-clustered  cross-word 
triphones.  All  HMMs  included  three  states,  with  diagonal  covariance  matrices,  and  the  state 
clustering  was  performed  using  a  decision  tree.  An  average  of  16  mixture  components  were  used 
for  each  HMM  state. 

Trigram  Language  Models  (LMs)  were  created  for  each  language  using  the  Carnegie 
Mellon  University  (CMU)-Cambridge  Toolkit  [7].  The  LM  probabilities  were  estimated  using 
the  train  partition  of  each  language,  but  the  vocabulary  was  expanded  to  include  all  words  in  the 
corpus.  Decoding  was  performed  using  both  the  HTK  decoder  HDecode  and  the  Julius  decoder 
[8].^  The  Word  Error  Rates  (WERs)  for  each  language  are  shown  in  Eigure  1 .  HDecode  yielded 
better  performance  than  Julius  in  all  languages. 


1  Available  at  http://htk.eng.cam.ac.uk 

2  Available  at  http://www.speech.cs.cmu.edu 

3  Available  at  http://julius.sourceforge.jp 
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□  HDecode  □  Julius 


Arabic 
Croatian 
Dari 
English 
French 
German 
Japanese 
Korean 

Mandarin  CALLHOME 

Mandarin  HUB4 
Pashto 
Russian 
Spanish 
Tagalog 
Turkish 
Urdu 

0%  10%  20%  30%  40%  50%  60%  70% 

Word  Error  Rate 

Figure  1:  WER  for  each  Language  (HDecode  and  Julius);  (*Mandarin  is  expressed  in 

character  error  rate) 

2,2  Vocal  Tract  Length  Normalization 

Vocal  Tract  Length  Normalization  (VTLN)  attempts  to  eompensate  for  different  vocal 
tract  lengths  by  linearly  warping  the  frequeney  axis  when  performing  filterbank  analysis. 
Warping  faetors  a  for  each  speaker  in  the  training  set  were  selected  using  the  following 
procedure  [9].  First,  single-mixture  monophone  HMMs  with  non-normalized  MFCC  features"^ 
were  estimated  from  the  complete  training  set  of  all  speakers.  Next,  eaeh  utterance  was 
phonemieally  aligned  using  the  non-normalized  HMMs  and  MFCC  features  computed  using 
warping  factors  a=0.80, 0.82,0.84,.  ..,1.20.  The  value  of  a  that  gave  the  maximum  score  was 
seleeted  for  each  speaker.  Lastly,  multiple-mixture  triphone  HMMs  were  estimated  from  the 
complete  training  set  using  the  normalized  MFCC  features. 

The  proeedure  used  to  seleet  the  warping  factor  a  for  each  utterance  in  the  test  set  can  be 

4  The  term  normalization  is  used  to  here  to  refer  to  MFCC  features  computed  from  a  warped  filterbank  using  a 
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summarized  as  follows.  First,  non-normalized  multiple-mixture  triphone  HMMs  with  non- 
normalized  MFCC  features  were  used  to  hypothesize  the  word  sequenee  for  the  utteranee.  Next, 
the  utterance  was  phonemically  aligned  using  the  normalized  single-mixture  monophone  HMMs 
and  MFCC  features  computed  using  warping  factors  a=0.80,0.82,0.84,...,1.20.  The  value  of  a 
that  gave  the  maximum  score  was  selected  for  the  utterance.  Lastly,  the  normalized  multiple- 
mixture  triphone  HMMs  and  normalized  MFCC  features  were  used  to  hypothesize  the  word 
sequence.  The  VTLN  procedure  was  evaluated  on  the  WSJl  English,  CALLHOME  Mandarin, 
and  GlobalPhone  Russian.  The  results  for  each  language  are  shown  in  Table  2.  Applying  VTEN 
reduced  the  error  rate  by  1.0%  on  English,  1.7%  on  Mandarin,  and  0.3%  on  Russian. 

Table  2:  WER  for  English  and  Russian,  and  Character  Error  Rate  for  Mandarin 


Language 

No  VTLN 

With  VTLN 

English 

11.8% 

10.8% 

Mandarin 

65.1% 

63.4% 

Russian 

29.6% 

29.3% 

2,3  Speaker  Adaptive  Training 

Speaker  Adaptive  Training  (SAT)  is  a  technique  used  to  train  Speaker  Independent  (SI) 
acoustic  models  that  integrates  speaker  normalization  as  part  of  the  model  estimation  procedure. 
The  procedure  used  to  implement  SAT  can  be  summarized  as  follows.  Eirst,  multiple-mixture 
triphone  HMMs  were  estimated  from  the  complete  training  set  of  all  speakers.  Next, 

Constrained  Maximum  Eikelihood  Linear  Regression  (CMLLR)^  was  used  to  compute  a  set  of 
linear  transformations  for  each  speaker.  Lastly,  the  SI  models  were  re-estimated  using  the 
speaker  transforms  to  adapt  the  training  features.  This  procedure  was  repeated  three  times  to 
train  the  final  model. 

The  decoding  procedure  can  be  summarized  as  follows.  Eirst,  the  original  SI  acoustic 
models  were  used  to  hypothesize  the  word  sequence  for  each  utterance.  Next,  each  utterance 
was  phonemically  aligned  using  the  SI  acoustic  models.  These  alignments  were  used  to  compute 
a  single  set  of  CMEER  transforms  for  each  speaker  using  the  SAT  models.  Easily,  the  SAT 
models  and  CMEER  transforms  were  used  to  hypothesize  the  word  sequence  for  each  utterance. 
The  SAT  technique  was  evaluated  on  the  GlobalPhone  Russian  and  ARL  Dari.  The  results  are 
shown  in  Table  3.  Applying  SAT  reduced  the  WER  by  4.5%  on  Russian  and  3.1%  on  Dari. 

Table  3:  SAT  for  Russian  and  Dari 


Language 

No  SAT 

With  SAT 

Russian 

29.6% 

25.1% 

Dari 

26.6% 

23.5% 

5  CMLLR  is  a  feature  adaptation  technique  that  shifts  the  feature  vectors  such  that  each  HMM  state  in  the  model  is 
more  likely  to  have  generated  the  features 
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2,4  ROVER 


Recognizer  Output  Voting  Error  Reduction  (ROVER)  [10]  is  a  technique  for  combining 
the  hypothesized  word  sequences  from  multiple  recognizers.  The  ROVER  technique  first  aligns 
the  word  sequences  output  from  the  different  recognizers  and  then  selects  the  final  word 
sequence  according  to  the  frequency  of  occurrence.  This  technique  was  evaluated  on  12 
different  languages  using  the  hypothesized  word  sequences  from  the  HDecode,  Julius,  and 
SONIC  [11]  decoders.  The  SRover  program  from  the  Brno  University  of  Technology^  was  used 
to  apply  ROVER.  Eigure  2  shows  the  error  rates  obtained  on  each  language.  An  improvement  in 
system  performance  was  obtained  on  all  languages  except  English.  Compared  to  the  best 
individual  system,  the  largest  decrease  in  WER  was  2.4%  on  Erench. 


6  Available  at  http://speech.fit.vutbr.cz 
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□  HDecode  nJulius  nSONIC  ■  ROVER 
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Figure  2:  WER  for  each  Language  (HDecode,  Julius,  SONIC  and  ROVER);  (*Mandarin  is 

expressed  in  character  error  rate) 
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3.0  ARTICULATORY  FEATURE  DETECTION 


Articulatory  Features  (AFs)  deseribe  the  way  in  whieh  speeeh  sounds  are  produeed.  One 
of  the  most  popular  methods  for  elassifying  speeeh  sounds  using  AFs  is  the  International 
Phonetie  Alphabet  (IPA)  [12],  Consonants  are  defined  by  AFs  that  deseribe  the  plaee  of 
artieulation,  manner  of  articulation,  and  voicing  status.  Vowels  are  elassified  using  AFs  that 
describe  both  the  tongue  position  and  the  shape  of  the  lips.  This  ehapter  discusses  two  methods 
that  were  investigated  for  deteeting  English  AFs.  Seetion  3.1  deseribes  how  fusion-based  AF 
deteetors  were  ereated  using  Gaussian  Mixture  Models  (GMMs)  and  two-elass  Multi-Layer 
Perceptrons  (MLPs).  Seetion  3.2  deseribes  how  multi-elass  MLPs  were  developed  for  English 
and  incorporated  into  a  Russian  and  Dari  speech  recognizer. 

3.1  Fusion-based  AF  Detectors 

This  seetion  discusses  how  fusion-based  AF  deteetors  were  ereated  for  English  and  used 
in  an  HMM -based  phoneme  reeognizer.  Seetions  3.1.1  and  3.1.2  deseribe  how  GMMs  and 
MLPs  were  used  to  ereate  AF  detectors.  Section  3.1.3  discusses  two  different  proeedures  that 
were  investigated  for  fusing  the  seores  from  the  GMMs  and  MLPs,  and  presents  results  obtained 
on  LIMIT.  Lastly,  Seetion  3.1.4  presents  results  obtained  on  the  CSLU  Multi-language 
Telephone  corpus.  Table  4  lists  the  AFs  used  to  deseribe  English  speeeh  sounds,  with  the 
exeeption  of  silence  (34),  where  the  number  in  parenthesis  indieates  the  feature  number. 


Table  4:  AFs  for  English  Consonants  and  Vowels 


CONSONANTS  (0) 

Place 

bilabial  (1),  labiodental  (2),  labialvelar  (3),  dental  (4),  alveolar  (5), 
postalveolar  (6),  retroflex  (7),  palatal  (8),  velar  (9),  glottal  (10) 

Manner 

plosive  (11),  nasal  (12),  tap  or  flap  (13),  frieative  (14), 
approximant  (15),  lateral  approximant  (16),  affrieate  (17) 

Voicing 

voieed  (18),  voiceless  (19) 

VOWELS  (20) 

Tongue 

Height 

close  (21),  near-close  (22),  mid  (23),  open-mid  (24), 
near-open  (25),  open  (26) 

Tongue 

Fronting 

front  (27),  near-front  (28),  eentral  (29), 
near-baek  (30),  baek  (31) 

Lip  Shape 

rounded  (32),  unrounded  (33) 
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3.1.1  GMM-based  AF  Detectors.  GMM-based  AF  detectors  were  trained  on  the  WSJl 
corpus  using  the  GMM  software  package  from  MIT  Lincoln  Laboratory  [13].  For  each  AF,  a 
GMM  was  trained  using  frames  where  the  feature  was  present,  and  a  second  GMM  was  trained 
using  frames  where  the  feature  was  absent.  All  models  used  256  mixture  components  with 
diagonal  covariance  matrices.  The  feature  set  consisted  of  12  MFCCs,  with  cepstral  mean 
subtraction,  plus  an  energy  feature.  Delta  and  acceleration  coeflhcients  were  also  included  to 
form  a  39  dimensional  feature  vector. 

The  scores  for  each  AF  were  calculated  as  follows.  Denote  the  presence  of  an  AF  as / 
and  the  absence  of  an  AF  as  g.  If  we  consider  the  speech  feature  vector  x,  then 

log  =  log  Pix  I  /)  -  log  p{x  I  g)  +  log  p{f)  -  log  pig)  ( 1) 

P(g  \  x) 

The  probabilities  p(x\f)  and p(x\g)  were  calculated  from  the  feature-present  and  feature-absent 
GMMs,  respectively.  The  probabilities  p(f)  and p(g)  were  estimated  from  the  training  data  by 
counting  the  occurrences  of  each  AF. 

3.1.2  MLP-based  AF  Detectors.  MLP -based  AF  detectors  were  trained  on  the  WSJl 
corpus  using  the  ICSI  QuickNet  software  package.^  A  three-layered  MLP  (input:  39  units, 
hidden:  100  units,  output:  2  units)  was  used  to  model  each  AF.  The  same  MFCC  feature  set 
described  in  Section  3.1.1  was  used  as  the  input,  and  sigmoid  activation  functions  were  used  on 
the  hidden  layer.  The  softmax  function  was  used  as  the  output  activation  function  during 
training;  however,  it  was  removed  when  scoring  the  MLPs  so  that  the  outputs  more  closely 
approximated  a  Gaussian  distribution.  The  final  score  for  each  AF  was  calculated  by  subtracting 
the  output  of  the  absent  unit  from  the  output  of  the  present  unit. 

3.1.3  Score  Fusion  on  TIMIT.  This  section  describes  two  procedures  that  were 
investigated  for  fusing  the  scores  from  the  GMM-  and  MLP-based  AF  detectors  [14].  Both 
methods  trained  a  fusion  MLP  for  each  AF  to  combine  the  scores.  All  fusion  MLPs  were  trained 
on  the  TIMIT  corpus  [15].  Fusion- 1  combined  the  scores  from  the  GMM-  and  MLP-based  AF 
detectors  for  a  given  AF  to  form  the  final  score  for  that  AF.  For  example,  the  fusion  MLP  for  the 
AF  plosive  used  input  features  consisting  of  the  output  of  the  GMM-based  plosive  detector  and 
the  MLP-based  plosive  detector.  Fusion-2  combined  the  scores  from  all  of  the  GMM-  and  MLP- 
based  AF  detectors  to  form  the  final  score  for  each  AF;  thus,  the  fusion  MLP  for  each  AF  was 
provided  information  about  all  AFs  from  two  different  classifiers. 

All  fusion  MLPs  included  100  hidden  units  with  sigmoid  activation  functions,  and  used 
the  softmax  output  activation  function  for  training.  The  fusion  MLPs  included  a  context  window 
of  nine;  that  is,  the  MLPs  used  the  vectors  at  times  t-4,t-3,  ■  ■  -,1+3,1+ 4  as  input  to  classify  the 
vector  at  time  t.  As  in  Section  3.1.2,  the  output  activation  function  was  removed  prior  to  scoring 
and  the  score  for  the  AF  was  calculated  by  subtracting  the  output  of  the  absent  unit  from  the 
output  of  the  present  unit. 

Figure  3  shows  the  AF  detection  results  obtained  on  the  TIMIT  test  set.  Each  symbol 
represents  the  average  Equal  Error  Rate  (EER)  of  the  individual  detectors  for  the  AF  groups 
shown  in  Table  4.  For  the  place  and  manner  classifiers,  the  GMM-based  detectors  outperformed 


7  Available  at  http://www.icsi.berkeley.edu/Speech/qn.html 
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the  MLP -based  deteetors;  for  all  other  groups  the  MLPs  yield  lower  EERs.  Eusion-1  yielded  an 

o 

average  deerease  in  EER  of  4.7%  absolute  eompared  to  the  best  GMM-  or  MLP -based  deteetor. 
The  best  overall  performanee  was  obtained  using  the  Pusion-2  proeedure,  whieh  yielded  an 
average  deerease  in  EER  of  8.2%  absolute  compared  to  the  best  GMM-  or  MLP -based  detector. 

♦  GMM  □  MLP  A  Fusion-1  )(  Fusion-2 


25% 


Place  Manner  Voicing  Tongue  Height  Tongue  Lip  Shape 

Fronting 

Figure  3:  Average  EER  of  the  AF  Detectors  on  the  TIMIT  Test  Set 

The  scores  from  the  different  AF  detectors  were  used  to  form  the  feature  set  for  an 
HMM-based  phoneme  recognizer.  First,  a  vector  was  formed  using  the  scores  from  the 
individual  AF  detectors.  Next,  these  feature  vectors  were  processed  with  a  Karhunen-Loeve 
Transformation  (KLT)  that  was  estimated  on  the  TIMIT  train  set.  The  KLT  was  included  to 
decorrelate  the  individual  AF  scores  so  that  diagonal  covariance  matrices  could  be  used  in  the 
HMMs.  Lastly,  delta  features  were  appended.  Monophone  and  triphone  HMMs  were  created  for 
each  feature  set.  All  systems  used  three  state  HMMs  with  16  mixtures  per  state  and  diagonal 
covariance  matrices.  Decoding  was  performed  using  a  bigram  phoneme  EM  that  was  estimated 
from  the  TIMIT  train  set  using  the  CMU-Cambridge  Toolkit.  The  MFCC  feature  set  described  in 
Section  3.1.1  was  used  for  the  baseline  system. 

Table  5  shows  the  Phoneme  Error  Rate  (PER)  obtained  with  each  feature  set  on  the 
TIMIT  test  set.  The  features  created  using  the  scores  from  the  GMM-based  detectors  yielded  the 
worst  performance.  An  improvement  in  recognition  performance  was  obtained  using  the  scores 
from  the  MLP -based  detectors,  however,  the  PER  was  still  higher  than  that  of  the  baseline 
MFCC  system.  The  Fusion- 1  features  outperformed  both  the  GMM  and  MLP  features  sets, 
although  an  increase  in  performance  over  the  baseline  MFCC  system  was  only  obtained  with 
monophone  models.  The  best  performance  was  obtained  using  the  Fusion-2  features. 


8  The  term  best  is  used  here  to  refer  to  the  detector  with  the  minimum  EER  for  each  AF 
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Table  5:  PER  Obtained  on  the  TIMIT  Test  Set 


MFCC 

GMM 

MLP 

Fusion-1 

Fusion-2 

Monophones 

39.5% 

42.1% 

39.9% 

38.8% 

35.8% 

Triphones 

35.9% 

40.8% 

38.4% 

38.4% 

35.6% 

It  is  worth  noting  that  the  Fusion-2  monophone  system  yielded  eomparable  performanee 
to  the  MFCC  triphone  system.  The  option  of  using  monophone  instead  of  triphone  models  with 
the  Fusion-2  features  can  be  a  significant  advantage  in  terms  of  decoding  time.  Excluding  the 
time  required  for  feature  extraction,  decoding  with  each  triphone  system  took  approximately  750 
minutes,  whereas  decoding  with  monophones  was  completed  in  about  20  minutes. 

3.1.4  Score  Fusion  on  CSLU.  This  section  discusses  AF  detection  on  the  CSLU  Multi- 
Language  corpus  [16].  Whereas  TIMIT  consists  of  lab-quality  recordings  of  read  speech  with 
broad  phonetic  coverage,  the  CSLU  corpus  includes  spontaneous  telephone  speech.  Thus,  these 
corpora  differ  in  speaking  style  (read  vs.  spontaneous),  channel  type  (close-talking  microphone 
vs.  telephone),  balance  of  phonetic  coverage,  and  sampling  rate. 

The  WSJll  and  TIMIT  corpora  were  first  downsampled  to  8  kHz  and  a  second  set  of 
Lusion-2  AL  detectors  were  retrained.  Next,  a  set  of  Lusion-2  AL  detectors  were  trained  on  the 
CSLU  corpus.  All  AL  detectors  were  created  using  the  same  procedure  described  in  Sections 
3 . 1 . 1  -3 . 1 .3 .  It  should  be  emphasized  that  all  fusion  MLPs  used  scores  from  GMM-  and  MLP- 
based  detectors  trained  on  WSJl  as  input.  Thus  for  the  CSLU  corpus,  the  base  GMM-  and  MLP- 
based  detectors  were  used  for  a  different  speaking  style  (read  vs.  spontaneous)  and  channel 
(close-talking  microphone  vs.  telephone). 

Ligure  4  shows  the  EERs  obtained  with  the  Lusion-2  AL  detectors.  Each  symbol  type 
represents  a  different  train-test  combination.  Lor  example,  TIMIT8-CSLU  shows  the  detection 
performance  obtained  on  the  CSLU  test  set  using  Lusion-2  AL  detectors  trained  on  the  TIMIT 
corpus  downsampled  to  8  kHz.  The  individual  symbols  represent  the  EER  of  each  AL  detector, 
where  the  feature  numbers  correspond  to  those  given  in  Table  4.  The  best  overall  performance 
was  obtained  on  the  TIMIT8-TIMIT8  condition.  The  average  EER  across  all  ALs  was  8.6%. 
When  evaluated  on  the  CSLU  corpus,  the  fusion  MLPs  trained  on  TIMIT8  yielded  an  average 
EER  of  14.1%,  which  is  an  increase  of  5.5%  compared  to  the  results  on  TIMIT8.  The  average 
EER  of  the  Lusion-2  AL  detectors  trained  and  evaluated  on  CSLU  was  11.5%. 
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♦  TIMIT8-TIMIT8  DCSLU-CSLU  ATIMIT8-CSLU 
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Figure  4:  EER  of  the  AF  Detectors  on  the  CSLU  Test  Set 

From  Figure  4  we  ean  see  that  some  of  the  AF  deteetors  are  more  robust  aeross  both 
eorpora  than  others.  For  example,  the  inerease  in  EER  on  TIMIT8-CSEE1  eompared  to  TIMIT8- 
TIMIT8  is  less  than  3.5%  for  the  AEs  labialvelar  (3),  lateral  approximant  (16),  voiced  (18), 
vowel  (20),  close  (21),  near-back  (30),  and  unrounded  (33).  The  inerease  in  EER  is  greater  than 
8.0%  for  the  AEs  alveolar  (5),  plosive  (11),  fricative  (14)  and  voiceless  (19).  This  suggests  that 
certain  AEs  are  less  affected  by  speaking  style  and  channel  type  than  other  AEs. 

As  in  Section  3.1.3,  the  scores  from  the  fusion  MLPs  were  used  to  form  the  feature  set  for 
an  HMM-based  phoneme  recognizer.  Monophone  and  triphone  HMMs  were  trained  for  each 
feature  set  on  the  CSEEl  corpus.  The  monophone  models  included  32  mixtures  per  state,  and  the 
triphone  models  included  12  mixtures  per  state.  All  systems  used  diagonal  covariance  matrices. 
Decoding  was  performed  using  a  trigram  phoneme  EM  that  was  estimated  from  the  CSEEl  train 
partition  using  the  CMU-Cambridge  Toolkit.  The  MECC  feature  set  described  in  Section  3.1.1 
was  used  for  the  baseline  system.  Table  6  shows  the  PER  obtained  with  each  feature  set  on  the 
CSEU  test  set.  Both  the  TIMIT8  and  CSEU  Eusion-2  feature  sets  outperform  the  MECC  system. 
The  best  performance  was  obtained  with  the  CSEU  Eusion-2  features:  compared  to  MECCs,  the 
PER  was  reduced  by  2.0%  absolute  when  decoding  with  either  monophone  or  triphone  models. 
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Table  6:  PER  Obtained  on  the  CSLU  Test  Set 


MFCC 

TIMIT8  Fusion-2 

CSLU  Fusion-2 

Monophones 

49.4% 

48.6% 

47.4% 

Triphones 

48.3% 

47.4% 

46.3% 

3.2  AF  Detection  using  Multi-Class  MLPs 

This  section  discusses  how  multi-class  MLPs  were  used  to  create  English  AF  detectors. 
Section  3.2.1  describes  the  procedure  used  to  train  the  MLPs.  Section  3.2.2  presents  detection 
results  obtained  on  SVitchboard  and  describes  how  the  scores  from  the  MLPs  were  used  as  the 
feature  set  for  a  speech  recognizer.  Lastly,  Section  3.2.3  presents  results  obtained  on  Russian 
and  Dari.  Table  7  lists  the  features  that  were  used  to  describe  English  speech  sounds. 


Table  7:  Features  used  to  Describe  English  Speech  Sounds  [17] 


Group 

Feature  Values 

Place 

alveolar,  dental,  labial,  labiodental,  lateral,  none,  postalveolar, 
rhotic,  velar,  silence 

Degree 

approximant,  closure,  flap,  fricative,  vowel,  silence 

Nasality 

-,  +,  silence 

Rounding 

-,  +,  silence 

Glottal  State 

aspirated,  voiceless,  voiced,  silence 

Vowel 

aa,  ae,  ah,  ao,  awl,  aw2,  ax,  axr,  ayl,  ay2,  eh,  er,  eyl,  ey2,  ih,  iy,  ix, 
owl,  ow2,  oyl,  oy2,  uh,  uw,  none,  silence 

Height 

high,  low,  mid,  mid-high,  mid-low,  very-high,  none,  silence 

Frontness 

back,  front,  mid,  mid-back,  mid-front,  none,  silence 

3.2.1  MLP-based  AF  Detectors.  Two  sets  of  MLPs  were  trained  for  each  of  the  eight 
AF  groups  shown  in  Table  7.  The  first  set  used  MFCCs  as  input,  and  the  second  set  used 
Perceptual  Einear  Prediction  (PEP)  coefficients.  The  MFCC  feature  set  was  the  same  as 
described  in  section  3.1.1,  except  that  both  mean  and  variance  normalization  were  applied  on  a 
per-conversation  side  basis.  The  PLP  feature  set  included  12  PLP  cepstral  coefficients,  plus 
energy,  delta,  and  acceleration  coefficients.  As  with  the  MFCCs,  mean  and  variance 
normalization  were  also  applied. 

The  MLPs  were  trained  on  the  Fisher  corpus  [18,  19]  using  the  ICSI  QuickNet  software 
package.  A  context  window  of  nine  was  used  on  the  input  layer,  and  the  number  of  hidden  units 
for  each  MLP  was  chosen  using  the  same  procedure  as  described  in  [17].  Sigmoid  activation 
functions  were  used  on  the  hidden  layer.  The  number  of  output  units  for  each  MLP  was  set  to  the 
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number  of  feature  values  for  that  AF  group,  and  the  softmax  funetion  was  used  as  the  output 
activation  function. 

3.2,2  AF  Detection  on  SVitchboard,  This  section  discusses  AF  detection  on  the 
SVitchboard  corpus  [20].  SVitchboard  is  a  small  vocabulary  corpus  that  includes  conversational 
telephone  speech.  A  subset  of  78  utterances  includes  AF  alignments  that  were  manually 
produced  [21].  Figure  5  shows  the  frame  level  accuracy  of  the  MLPs  trained  on  Fisher  using 
MFCC  and  PLP  coefficients  as  input.  For  comparison  purposes,  the  detectors  from  [17]  were 
also  evaluated  on  these  utterances.  These  detectors,  referred  to  as  Frankel  in  this  document,  use 
the  same  network  typology  and  PLP  feature  set  as  the  MLPs  described  in  section  3.2.1.  Overall, 
similar  performance  is  obtained  with  each  set  of  MLPs.  The  largest  difference  in  accuracy  is 
2.0%  (Frankel  vs.  PLP  degree).  The  lowest  accuracy  was  75.8%  (MFCC  place),  and  the  highest 
accuracy  was  95.4%  (Frankel  nasality). 


□  Frankel  DPLP  DMFCC 

100% 


Figure  5:  Frame  Level  Accuracy  of  the  MLP-based  AF  Detectors  on  the  SVitchboard 

Corpus 

The  scores  from  the  MLPs  were  used  to  form  the  feature  set  for  an  HMM-based  speech 
recognizer.  First,  a  vector  was  formed  using  the  scores  from  the  individual  AF  detectors.  When 
computing  these  scores,  the  output  activation  function  was  removed  so  that  the  scores  more 
closely  approximated  a  Gaussian.  Next,  these  feature  vectors  were  processed  with  a  KLT  that 
was  estimated  on  the  SVitchboard  train  set,  and  the  top  26  dimensions  were  retained.  This 
feature  vector  was  appended  to  the  PLP  feature  set  described  in  Section  3.2.1  to  form  a  65 
dimensional  vector. 

Within-word  triphone  HMMs  were  trained  for  each  feature  set.  All  systems  used  three 
state  HMMs  with  12  mixtures  per  state  and  diagonal  covariance  matrices.  Decoding  was 
performed  using  a  bigram  LM  that  was  estimated  from  the  SVitchboard  train  set  using  HTK. 
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The  PLP  features  formed  the  baseline  system.  Table  8  shows  the  WER  obtained  with  eaeh 
system.  From  Table  8  we  ean  see  that  ineorporating  the  seores  from  the  MLPs  yielded  an 
improvement  in  system  performanee.  The  best  WER  was  obtained  with  the  PLP  system  that 
ineorporated  the  Frankel  MLPs:  eompared  to  the  baseline  PLP  system,  a  reduetion  in  WER  of 
6.0%  was  obtained.  Note  also  that  the  MLP  system  with  PLP  input  features  yielded  better 
performanee  than  the  MLP  system  with  MFCC  input  features. 

Table  8:  WER  on  the  SVitchboard  500  Word  Vocabulary  Task 


Features 

WER 

PLP 

50.6% 

PLP  +  Frankel 

44.6% 

PLP  +  MLPs  with  PLP  input 

44.8% 

PLP  +  MLPs  with  MFCC  input 

46.0% 

3.2,3  Cross-Lingual  AF  Detection.  The  Frankel  MLPs  were  also  evaluated  on  the 
GlobalPhone  Russian  and  ARL  Dari.  Whereas  the  Frankel  MLPs  were  trained  on  English 
eonversational  telephone  speeeh,  the  GlobalPhone  Russian  and  ARL  Dari  eorpora  eonsist  of  read 
mierophone  speeeh.  Thus,  these  eorpora  differ  not  only  in  language,  but  also  in  speaking  style 
(eonversational  vs.  read),  ehannel  type  (telephone  vs.  mierophone),  and  sampling  rate. 

The  GlobalPhone  Russian  and  ARL  Dari  eorpora  were  first  downsampled  to  8  kHz  and 
PLP  features  were  extraeted.  These  features  were  used  as  input  to  the  Frankel  MLPs,  whieh 
were  evaluated  with  the  output  aetivation  funetions  removed.  Next,  a  veetor  was  formed  using 
the  seores  from  the  individual  AF  deteetors  and  proeessed  with  a  KLT  that  was  estimated  on  the 
train  partition  of  eaeh  language.  The  top  26  dimensions  were  retained  and  appended  to  the 
MFCC  feature  set  deseribed  in  Seetion  2. 1 .  This  feature  veetor  was  used  to  train  an  HMM-based 
speeeh  reeognizer  for  eaeh  language.  The  HMM  systems  were  trained  using  the  same  proeedure 
deseribed  in  Seetion  2.1  and  deeoding  was  performed  using  HDeeode.  The  WER  for  eaeh 
language  is  shown  in  Table  9.  Ineorporating  the  Frankel  MLPs  redueed  the  WER  by  1 .6%  on 
Russian  and  1 .4%  on  Dari. 


Table  9:  WER  on  Russian  and  Dari 


Language 

MFCC 

MFCC  +  Frankel 

Russian 

29.6% 

28.0% 

Dari 

26.4% 

25.0% 
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4,0  SPEECH  SYNTHESIS  IN  13  LANGUAGES 


Speech  synthesis  systems  were  developed  for  13  different  languages  using  the  Hidden 
Markov  Model  (HMM)  Speech  Synthesis  ToolKit  (HTS).  This  chapter  describes  these  systems 
and  provides  an  overview  of  two  different  Graphical  User  Interfaces  (GUIs)  that  were  developed 
for  creating  new  voices  and  synthesizing  speech.  Section  4. 1  provides  an  overview  of  the 
baseline  synthesis  systems.  Section  4.2  describes  three  English  and  two  Urdu  speech  synthesis 
systems  that  were  created  using  an  expanded  model  set.  Section  4.3  discusses  the  effect  of 
modifying  the  Minimum  Description  Length  (MDL)  control  factor.  Section  4.4  discusses 
speaker  clustering  and  adaptation  for  creating  English  and  Mandarin  voices.  Lastly,  Section  4.5 
provides  a  brief  overview  of  the  GUIs  that  were  developed. 

4,1  Baseline  Synthesis  Systems 

This  section  discusses  the  baseline  synthesis  systems  that  were  developed  for  Arabic 
Iraqi,  Croatian,  Dari,  English,  Erench,  German,  Mandarin,  Pashto,  Russian,  Spanish,  Tagalog, 
Turkish,  and  Urdu.  A  total  of  six  different  corpora  were  used  to  obtain  coverage  of  all  languages, 
including  the  Spoken  Language  Communication  and  Translation  System  for  Tactical  Use 
(TRANSTAC)  corpus,  GlobalPhone,  ARE,  CMU  Arctic  [22],  HUB4,  and  LASER.  All  of  these 
corpora  include  speech  data  that  were  recorded  with  a  16  kHz  sampling  frequency.  The  CMU 
Arctic  database  was  developed  specifically  for  speech  synthesis  and  includes  automatically 
generated  time-aligned  transcriptions;  all  other  corpora  are  only  transcribed  at  the  utterance 
level.  Phoneme  alignments  for  the  TRANSTAC,  GlobalPhone,  ARE,  HUB4,  and  LASER 
corpora  were  automatically  generated  using  SONIC. 

HMM -based  speech  synthesis  systems  were  developed  for  each  language  using  HTS -2.0 
[23].^  The  feature  set  consisted  of  25  Mel  Cepstral  Coefficients  and  the  logarithm  of  the 
fundamental  frequency  (EO).  Prior  to  computing  the  features,  the  DC  mean  was  removed  from 
each  waveform  file  and  amplitude  normalization  was  applied  to  several  of  the  corpora.  The  Mel 
Cepstral  coefficients  were  calculated  using  the  Speech  Signal  Processing  ToolKit  (STTK),**^  and 
the  EO  values  were  estimated  using  the  ESPS  method  implemented  in  snack.  Delta  and 
acceleration  coefficients  were  also  included  to  form  a  78  dimensional  feature  vector. 

Cross-word  triphone  Multi-Space  probability  Distribution  (MSD)-HMMs  [24]  were 
trained  for  each  language.  All  MSD-HMMs  included  five  states  with  diagonal  covariance 
matrices,  and  the  state  durations  for  each  triphone  were  modeled  by  a  Gaussian  distribution. 
Decision  tree  based  clustering  was  applied  to  the  Mel  Cepstrum,  EO,  and  state  duration 
distributions  independently;  thus,  two  decision  trees  were  created  for  each  MSD-HMM  state, 
plus  an  additional  decision  tree  for  the  state  duration  model.  Table  10  lists  the  voices  that  were 
created  for  each  language,  the  corpora  used,  the  number  of  speakers  used  to  train  the  voices,  and 
the  total  amount  of  training  data  used  to  develop  the  synthesizers. 


9  Available  at  http://hts.sp.nitech.ac.jp 

10  Available  at  http://sp-tk.sourceforge.net 

1 1  Available  at  http://www.speech.kth.se/snack 
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Table  10:  Overview  of  Voices  Created 


Language 

Corpus 

Voices 

Speaker  Count 

Hours 

Arabic  Iraqi 

TRANSTAC 

Speakerl 

Speaker2 

370 

30 

10 

3 

Croatian 

GlobalPhone 

Male 

32 

5 

Eemale 

48 

7 

Dari 

ARE 

Malel 

15 

2 

Male2 

15 

2 

Male 

4 

3 

English 

CMU  Arctic 

Eemale 

2 

2 

SET 

1 

1 

Erench 

GlobalPhone 

Male 

39 

10 

Eemale 

40 

11 

German 

GlobalPhone 

Male 

60 

13 

Eemale 

5 

1 

Male 

10 

2 

Mandarin 

HUB4 

Wang  Jianchuan 

1 

1 

Eemale 

8 

2 

Eang  Jing 

1 

1 

Mandarin 

GlobalPhone 

Male 

15 

4 

Pashto 

EASER 

Random  1 

10 

1 

Random2 

10 

1 

Russian 

GlobalPhone 

Male 

49 

9 

Eemale 

44 

9 

Male 

38 

8 

Spanish 

GlobalPhone 

Eemale 

46 

10 

Tagalog 

Male 

20 

2 

EASER 

Eemale 

28 

4 

Turkish 

GlobalPhone 

Male 

24 

4 

10 

Eemale 

60 

Urdu 

EASER 

Male 

76 

17 

Eemale 

84 

20 

4,2  Full-Context  Models 

This  section  discusses  the  English  and  Urdu  speech  synthesizers  that  were  created  using 
an  expanded  model  set.  As  mentioned  in  Section  4.1,  the  baseline  synthesis  systems  for  each 
language  used  cross-word  triphone  models.  Although  these  models  produce  intelligible  speech, 
there  are  numerous  other  contextual  factors  that  can  affect  the  overall  prosody  and  naturalness  of 
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speech.  In  order  to  incorporate  these  contextual  factors,  the  triphone  labels  for  each  speech 
database  have  to  be  expanded  to  include  all  features  of  interest.  For  example,  the  labels  supplied 
with  the  HTS  demos  for  the  CMU  Arctic  database  consist  of  53  different  contextual  features, 
including  syllable,  accent,  stress,  part-of-speech,  word,  and  phrase  information.  These  labels  are 
then  used  to  define  the  acoustic  models;  thus,  a  separate  MSD-HMM  is  trained  for  each  phoneme 
that  appears  in  a  different  context.  Note  that  this  can  result  in  a  very  large  model  set  prior  to 
clustering.  For  example,  the  training  data  for  the  English  SET  voice  includes  38866  phoneme 
instances:  using  cross-word  triphone  labels  requires  9480  unique  MSD-HMMs,  whereas  using 
the  expanded  label  set  requires  38765  unique  MSD-HMMs.  An  expanded  set  of  labels  were 
derived  for  Urdu  that  included  syllable,  word,  and  phrase  information.  These  labels  included  a 
total  of  3 1  different  contextual  features.  Syllable  information  was  explicitly  marked  in  the 
pronunciation  lexicon,  and  phrase  information  was  derived  by  assigning  a  break  wherever 
silence  was  labeled.  Table  1 1  lists  the  expanded  label  set  derived  for  Urdu. 
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Table  11:  Expanded  Label  Set  for  Urdu 


pi 

p2 

p3 

p4 

p5 

p6 

p7 

the  phoneme  identity  before  the  previous  phoneme 

the  previous  phoneme  identity 

the  current  phoneme  identity 

the  next  phoneme  identity 

the  phoneme  after  the  next  phoneme  identity 

position  of  the  current  phoneme  in  the  current  syllable  (forward) 

position  of  the  current  phoneme  in  the  current  syllable  (backward) 

al 

the  number  of  phonemes  in  the  previous  syllable 

bl 

b2 

b3 

b4 

b5 

b6 

the  number  of  phonemes  in  the  current  syllable 
position  of  the  current  syllable  in  the  current  word  (forward) 
position  of  the  current  syllable  in  the  current  word  (backward) 
position  of  the  current  syllable  in  the  current  phrase  (forward) 
position  of  the  current  syllable  in  the  current  phrase  (backward) 
name  of  the  vowel  of  the  current  syllable 

cl 

the  number  of  phonemes  in  the  next  syllable 

dl 

the  number  of  syllables  in  the  previous  word 

el 

e2 

e3 

the  number  of  syllables  in  the  current  word 

position  of  the  current  word  in  the  eurrent  phrase  (forward) 

position  of  the  current  word  in  the  eurrent  phrase  (baekward) 

fl 

the  number  of  syllables  in  the  next  word 

gl 

g2 

the  number  of  syllables  in  the  previous  phrase 
the  number  of  words  in  the  previous  phrase 

hi 

h2 

h3 

h4 

the  number  of  syllables  in  the  current  phrase 

the  number  of  words  in  the  current  phrase 

position  of  the  current  phrase  in  this  utterance  (forward) 

position  of  the  current  phrase  in  this  utterance  (backward) 

11 

12 

the  number  of  syllables  in  the  next  phrase 
the  number  of  words  in  the  next  phrase 

jl 

j2 

the  number  of  syllables  in  this  utterance 
the  number  of  words  in  this  utterance 
the  number  of  phrases  in  this  utterance 

Each  of  the  three  English  voices  and  the  two  Urdu  voiees  were  retrained  using  the 
expanded  labels.  Overall,  there  was  not  a  substantial  improvement  in  voiee  quality.  This  may  be 
due  to  the  limited  amount  of  speech  data  available  to  train  different  models  for  each  phoneme  in 
a  partieular  context. 
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4,3  MDL  Control  Factor 


Decision  tree  clustering  in  HTS  is  based  on  the  MDL  criterion  [25].  The  MDL  criterion 
is  used  for  selecting  the  questions  when  splitting  nodes,  and  deciding  when  to  stop  growing  the 
decision  trees.  A  control  factor  X  is  used  to  weight  the  penalty  that  the  MDL  criterion  imposes 
for  model  complexity.  As  X  is  increased,  the  penalty  for  a  large  model  become  larger  and  the 
stopping  criterion  is  met  sooner  (thus  producing  a  decision  tree  with  fewer  leaves).  The  English 
male  and  female  voices  described  in  Section  4.2  were  retrained  using  X  =  1.0,0. 7,0. 4.  The  total 
number  of  leaves  obtained  for  each  X  is  shown  in  Figure  6.  As  X  is  increased,  the  total  number  of 
leaves  for  each  of  the  decision  trees  decreases. 
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Figure  6:  Total  Number  of  Leaves  Generated  for  the  English  Male  and  Female  Voice  when 

Modifying  the  MDL  Control  Factor  X 
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4.4  Speaker  Clustering  and  Adaptation 

This  section  discusses  how  speaker  clustering  and  adaptation  were  used  to  create  voices 
for  Mandarin  and  English.  A  total  of  52  different  Mandarin  speech  synthesis  systems  were 
trained  on  the  GlobalPhone  corpus  using  groups  of  three  or  more  speakers.  The  speaker  groups 
were  defined  based  on  the  individual  speakers  FO  values  and/or  speaker  recognition  scores.  Two 
additional  voices  were  also  created  on  the  HUB4  Mandarin  corpus  by  adapting  the  Male  voice 
using  speech  from  Wang  Jianchuan,  and  adapting  the  Female  voice  using  speech  from  Fang  Jing. 
The  adaptation  transforms  were  estimated  using  Constrained  Maximum  Fikelihood  Finear 
Regression  (CMFFR). 

A  total  of  53  English  speech  synthesis  systems  were  trained  on  Phase  I  of  the  Wall  Street 
Journal  (WSJO)  corpus  [26]  and  WSJ  1 .  These  systems  were  developed  using  HTS-2. 1 .  Cross¬ 
word  triphone  MSD  Hidden  Semi-Markov  Models  (HSMMs)  [27]  were  created  for  each  voice 
using  the  same  feature  set  as  described  in  Section  4.1.  As  with  the  other  corpora,  the  phoneme 
alignments  were  automatically  generated  using  SONIC.  The  first  25  voices  were  created  using 
groups  of  three  or  more  speakers.  The  speaker  groups  were  defined  based  on  speaker 
recognition  scores:  19  groups  of  speakers  were  derived  from  a  speaker  confusion  matrix,  and  the 
remaining  six  groups  were  derived  using  a  spectral  clustering  algorithm  [28].  Next,  one  set  of 
MSD-HSMMs  were  trained  using  3600  utterances  from  nine  different  speakers  (-400  utterances 
from  each  speaker),  and  a  second  set  of  MSD-HSMMs  were  trained  using  3502  utterances  from 
20  different  speakers  (-200  utterances  from  each  speaker).  These  models  were  adapted  using 
speech  from  one  of  22  different  speakers  to  create  the  remaining  28  voices.  Adaptation  was 
performed  using  Constrained  Structural  Maximum- A-Posteriori  Finear  Regression  (CSMAPFR), 
followed  by  MAP  adaptation  [29]. 

4.5  Synthesis  GUIs 

This  section  describes  two  GUIs  that  were  developed  for  training  and  evaluating  speech 
synthesizers.  The  first  interface  can  be  used  to  setup  a  speech  synthesis  experiment.  This 
program  allows  the  user  to  choose  a  set  of  speakers  to  train  the  voice  and  adjust  system 
parameters  related  to  speech  analysis,  model  settings,  and  synthesis.  Figure  7  shows  two 
instances  of  the  interface:  the  top  one  shows  the  speaker  selection  dialog,  and  the  bottom  one 
shows  the  spectrum  analysis  dialog.  Once  all  configuration  options  have  been  specified,  this 
program  creates  the  makefiles  for  training  and  evaluating  the  system. 


12  The  speaker  recognition  experiments,  FO  analysis,  and  speaker  cluster  definitions  described  in  this  section 
(except  for  those  derived  using  the  spectral  clustering  algorithm)  were  generated  by  Mr.  Eric  Hansen 

13  Available  at  http://hts.sp.nitech.ac.jp 
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12  Waveform  Files   ^[x 
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Figure  7:  GUI  for  Configuring  a  Speech  Synthesis  Experiment;  speaker  selection 
dialog  is  shown  on  top,  and  the  spectrum  analysis  dialog  is  shown  on  the  bottom 
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The  second  interface  can  be  used  to  synthesize  speech,  modify  pronunciations,  and  create 
new  voices  by  modifying  the  synthesis  parameters.  The  text  to  synthesize  can  be  entered  using 
either  the  keyboard  or  read  from  a  text  file,  and  the  pronunciations  can  be  modified  and  saved  on 
a  per-speaker  basis.  The  following  synthesis  parameters  can  be  modified:  all-pass  constant,  post¬ 
filtering  coefficient,  speech  speed  rate,  multiplicative  and  additive  constants  for  FO, 
voiced/unvoiced  threshold,  spectrum  and  FO  global  variance  weights,  amplitude  normalization 
constant,  maximum  state  duration  variance,  and  model  interpolation  coefficients.  Figure  8  shows 
the  main  interface  and  pronunciation  editor. 
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Figure  8:  GUI  for  Synthesizing  Speech;  the  main  interface  is  shown 
on  top,  and  the  pronunciation  editor  is  shown  on  the  bottom 
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5.0  CONCLUSIONS  AND  RECOMMENDATIONS 


This  document  summarized  work  eompleted  by  General  Dynamies  during  the  period 
August  2004  to  February  2009.  Speeeh  reeognition  systems  were  developed  for  15  different 
languages  using  HTK.  Three  methods  were  investigated  for  improving  the  performance  of  these 
systems:  VTLN,  SAT,  and  the  ROVER  teehnique.  Applying  VTLN  yielded  improvements  of 
1 .0%  on  English,  1 .7%  on  Mandarin,  and  0.3%  on  Russian.  SAT  redueed  the  WER  by  4.5%  on 
Russian  and  3. 1%  on  Dari.  The  ROVER  technique  yielded  improvements  in  system 
performanee  of  up  to  2.4%.  Given  the  substantial  gains  in  system  performanee  obtained  with 
SAT,  reeommendations  for  future  work  inelude  evaluating  SAT  aeross  all  languages, 
investigating  how  mueh  speeeh  data  is  needed  from  a  single  speaker  to  obtain  an  improvement  in 
performance,  and  implementing  an  automatie  method  for  detecting  speaker  ehanges  and 
elustering  speakers  so  that  SAT  ean  be  applied  to  data  where  the  speaker  boundaries  are 
unknown  {i.e.,  broadeast  news). 

AE  deteetors  were  developed  for  English  using  GMMs,  two-elass  MLPs,  fusion  MLPs, 
and  multi-elass  MEPs.  The  outputs  of  the  deteetors  were  used  to  form  feature  sets  for  HMM- 
based  phoneme  and  word  reeognizers.  On  TIMIT,  the  Pusion-2  feature  set  yielded  an 
improvement  in  PER  of  3.7%  eompared  to  an  MECC  system  when  deeoding  with  monophones. 
On  CSEU,  the  Eusion-2  features  yielded  improvements  of  2.0%  PER  eompared  to  MFCCs  when 
deeoding  with  either  monophone  or  triphone  models.  On  SVitehboard,  appending  the  seores 
from  the  multi-elass  MLPs  to  PEP  features  yielded  an  improvement  in  WER  of  6.0%.  Einally, 
appending  the  scores  from  the  English  multi-class  MLPs  to  MECC  features  reduced  the  WER  by 
1 .6%  on  Russian  and  1 .4%  on  Dari.  Reeommendations  for  future  work  inelude  evaluating  the 
English  AE  deteetors  aeross  all  languages,  investigating  methods  for  adapting  the  multi-elass 
MLPs  to  different  languages,  and  using  alternative  aeoustic  features  for  input  to  the  MLPs. 

Speeeh  synthesis  systems  were  developed  for  13  different  languages  using  HTS.  Pour 
methods  were  investigated  for  modifying  these  systems:  expanding  the  model  set  to  inelude 
additional  eontextual  features,  ehanging  the  MDL  eontrol  faetor,  using  speaker  reeognition 
seores  and/or  PO  values  for  grouping  speakers  to  train  voiees,  and  applying  speaker  adaptation. 
Two  GUIs  were  also  developed  for  training  and  evaluating  the  speech  synthesizers. 
Reeommendations  for  future  work  inelude  investigating  how  mueh  speeeh  data  is  needed  to 
obtain  an  improvement  when  using  an  expanded  model  set,  determining  how  mueh  speeeh  data 
is  needed  for  speaker  adaptation,  and  investigating  the  effeets  of  using  different  speaker 
groupings  to  train  the  base  model  that  is  used  for  adaptation. 
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LIST  OF  ACRONYMS  &  GLOSSARY 


AF 

Articulatory  Eeature 

AFRL 

Air  Eorce  Research  Eaboratory 

ARL 

Army  Research  Eaboratory 

CALLHOME 

a  speech  corpus  of  unscripted  telephone  conversations 

CMLLR 

eonstrained  maximum  likelihood  linear  regression 

CMU 

Carnegie  Mellon  University 

CSMAPLR 

Constrained  Struetural  Aaximum-A-Posteriori  Einear  Regression 

EER 

Equal  Error  Rate 

EO 

fundamental  frequeney 

Eisher 

a  speeeh  corpus  of  telephone  eonversations 

GMM 

Gaussian  Mixture  Model 

GUI 

Graphieal  User  Interfaee 

GlobalPhone 

a  multilingual  text  and  speeeh  database 

HMM 

Hidden  Markov  Model 

HSMM 

hidden  semi-Markov  model 

HTK 

Hidden  Markov  ToolKit 

HTK 

Cambridge  University  hidden  Markov  model  toolkit 

HTS 

hidden  Markov  model  based  speeeh  synthesis  toolkit 

HUB4 

a  broadeast  news  speech  corpus 

HDecode 

Cambridge  University  large  voeabulary  eontinuous  speech  recognizer 

ICSI 

International  Computer  Scienee  Institute 

IPA 

International  Phonetie  Alphabet 

Julius 

an  open  souree  large  vocabulary  continuous  speech  recognition  engine 

KET 

Karhunen  Eoeve  transformation 

EASER 

Eanguage  and  Speeeh  exploitation  Resourees 

EM 

Eanguage  Model 

MAP 

Maximum- A-Posteriori 

MDL 

Minimum  Deseription  Eength 

MECC 

Mel-Erequeney  Cepstral  Coeffieient 

MEP 

Multi-Layer  Perceptron 

MSD 

Multi-Spaee  Probability  Distribution 

PER 

Phoneme  Error  Rate 

PEP 

Pereeptual  Linear  Prediction 

ROVER 

Recognizer  Output  Voting  Error  Reduetion 

SAT 

Speaker  Adaptive  Training 

SI 

Speaker  Independent 

SONIC 

University  of  Colorado  continuous  speech  recognizer 

SPTK 

Speeeh  Signal  Proeessing  Toolkit 

SRover 

University  of  Brno  implementation  of  reeognizer  output  voting  error  reduetion 

TDT4 

phase  four  of  the  topie  deteetion  and  traeking  eorpus 

TRANSTAC 

Translation  System  for  Tactieal  Use 

VTEN 

Voeal  Traet  Length  Normalization 
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WER 

WSJO 

WSJl 


Word  Error  Rate 

phase  one  of  the  Wall  Street  Journal  eorpus 
phase  two  of  the  Wall  Street  Journal  corpus 
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